Combining Text Mining and Sequence Analysis to Discover Protein Functional Regions

نویسندگان

  • Eleazar Eskin
  • Eugene Agichtein
چکیده

Recently presented protein sequence classification models can identify relevant regions of the sequence. This observation has many potential applications to detecting functional regions of proteins. However, identifying such sequence regions automatically is difficult in practice, as relatively few types of information have enough annotated sequences to perform this analysis. Our approach addresses this data scarcity problem by combining text and sequence analysis. First, we train a text classifier over the explicit textual annotations available for some of the sequences in the dataset, and use the trained classifier to predict the class for the rest of the unlabeled sequences. We then train a joint sequence text classifier over the text contained in the functional annotations of the sequences, and the actual sequences in this larger, automatically extended dataset. Finally, we project the classifier onto the original sequences to determine the relevant regions of the sequences. We demonstrate the effectiveness of our approach by predicting protein sub-cellular localization and determining localization specific functional regions of these proteins.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A review of text mining approaches and their function in discovering and extracting a topic

Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling.  Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...

متن کامل

WildSpan: Efficient Discovery of Functional Motifs Spanning Large Wildcard Regions from Protein Sequences

Motivation: Automatic extraction of motifs from biological sequences is an important problem in molecular biology. For proteins, it is desired to discover sequence motifs containing large irregular gaps as the contact residues associated with a functional site are not always from one region of the sequences. Discovering such patterns is a time-consuming task due to a large number of combination...

متن کامل

Mining Conserved Local Structure from Functional Hierarchical Classification via Local Structure Comparison

Local region conservation has been observed in recent years and become more and more important in structure biology. Recent researches point out that local conservation regions are correlated to protein functional sites and functions and studies show that some local conservation on sequence or structure are close to binding area. Hence, in order to realize how function works, we can discover lo...

متن کامل

Combining Biological Databases and Text Mining to Support New Bioinformatics Applications

A large amount of biological knowledge today is only available from full-text research papers. Since neither manual database curators nor users can keep up with the rapidly expanding volume of scientific literature, natural language processing approaches are becoming increasingly important for bioinformatic projects. In this paper, we go beyond simply extracting information from fulltext articl...

متن کامل

MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences

This paper presents a web service named MAGIIC-PRO, which aims to discover functional signatures of a query protein by sequential pattern mining. Automatic discovery of patterns from unaligned biological sequences is an important problem in molecular biology. MAGIIC-PRO is different from several previously established methods performing similar tasks in two major ways. The first remarkable feat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

دوره   شماره 

صفحات  -

تاریخ انتشار 2004